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ABSTRACT 


The ability to identify out-of-distribution (OOD) data is a critical component in deploying robust machine learning 


models in real-world applications [1-9]. OOD detection aims to identify instances that deviate significantly from 
the training distribution, ensuring the reliability of model predictions and minimizing the risk of erroneous outputs. 
This capability is particularly crucial in safety-critical domains such as autonomous driving [10-16], healthcare 


[17-22], and security systems, where the presence of unfamiliar data can lead to catastrophic failures. 


Various approaches have been proposed to address the problem of OOD detection, ranging from statistical 
techniques to deep learning-based methods. Traditional methods often rely on simple feature extraction and 


anomaly detection algorithms, which may be inadequate for capturing complex data distributions. 


Recently, the Contrastive Language—Image Pretraining (CLIP) model has emerged as a powerful backbone for 
feature extraction. CLIP leverages extensive internet data to learn rich, multimodal representations of images and 
text, demonstrating impressive zero-shot learning capabilities and effectiveness in various tasks without the need 


for task-specific fine-tuning. 


Similarly, diffusion models have gained attention for their ability to generate high-quality images through a 
denoising process. These models learn the data distribution by progressively removing noise from a corrupted 
version of the mage, effectively capturing the underlying data manifold. The prior knowledge embedded in 
diffusion models can be instrumental in reconstructing images and identifying anomalies. It can be beneficial for 


3D vision tasks [23—27] and scene understanding [28-33]. 
1.1. Study Objectives 
The primary objectives of this study are as follows: 
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(i) To develop a novel OOD detection method that integrates the generative capabilities of diffusion models with 


the robust feature extraction of CLIP. 


(ii) To evaluate the effectiveness of the proposed method in accurately reconstructing images and identifying OOD 


instances by analyzing the discrepancy between original and reconstructed images. 


(iii) To assess the practicality of the proposed method in scenarios without requiring class-specific labeled 


in-distribution (ID) data. 


(iv) To conduct extensive experiments on several benchmark datasets to validate the robustness and efficacy of the 


proposed method. 


(v) To compare the performance of the proposed method with existing OOD detection techniques, highlighting 


improvements in detection accuracy and scalability. 


(vi) To leverage the zero-shot classification capability of CLIP for classification tasks, enabling the use of large 


in-distribution datasets without the need for labeled OOD data. 


In this paper, we propose a novel approach to OOD detection by exploiting the diffusion prior. The core insight of 
our approach is that a model capable of accurately reconstructing an image indicates that the image is likely part of 
the distribution the model has learned. Conversely, poor reconstruction suggests that the image is 
out-of-distribution. Our method involves utilizing the CLIP model to encode the image and using its features as 
conditional input for the diffusion model. By comparing the discrepancy between the reconstructed image and the 
original input, we can effectively determine if an image is OOD. This approach is based on the assumption that the 
model can only accurately reconstruct images of classes it has encountered during training, leveraging both the 


image input and its feature representation. 


Additionally, for classification purposes, we utilize the zero-shot classification capability of CLIP, allowing us to 
classify images without fine-tuning the model. This is particularly advantageous as it enables the use of large 


amounts of in-distribution data without requiring labeled OOD data. 
Our main contributions can be summarized as follows: 
e We propose a novel OOD detection method based on the integration of CLIP and diffusion models. 


e We conduct extensive experiments on multiple benchmarks, demonstrating the robustness and efficacy of our 
method in OOD detection. 

oe 2. Related Works 

(a) Out-of-Distribution Detection 

Several methods have been introduced to address the complex problem of OOD detection [34-41]. A common 
strategy involves leveraging uncertainty estimation techniques, such as Bayesian modeling [42], to assess 
prediction uncertainty and identify OOD samples. Prominent techniques in this domain include Maximum Softmax 


Probabilities (MSP), which uses the maximum softmax output as a confidence measure [43], Mahalanobis distance 


[44], and Monte Carlo Markov Chain methods that facilitate sampling from high-dimensional distributions [45]. 
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Ensemble models are also widely acknowledged for enhancing the robustness and performance of machine 


learning systems, including OOD detection [46]. In OOD detection, ensemble methods integrate multiple base 


models for predictions, fitting into both probabilistic and uncertainty-based frameworks. 


Supervised methods have shown some efficacy in reducing the incidence of erroneously high-confidence 
predictions on OOD inputs [47]; however, they are constrained by the necessity of labeled OOD data for training. 
Common unsupervised approaches include density estimation techniques [48]. Recent research indicates that 
augmentation and adversarial perturbation can improve OOD detection performance [49]. A key strength of our 
proposed OOD detection method is that it does not require specific class labels for training the diffusion model. 
Instead, it only requires in-distribution samples to learn the distribution, enabling it to determine whether a sample 


is in-distribution or OOD during testing. 
(b) Pre-trained Vision-Language Models 


Interpreting the semantic information within images remains a significant challenge in computer vision [50-60]. 
The emergence of Transformers [61] has made a great impact on not only natural language processing field [62— 
74], but also vision-related tasks [75], paving the way for the introduction of CLIP [76], a powerful pre-trained 
vision-language model. By utilizing contrastive learning along with extensive models and datasets [77], CLIP 
employs image-text pairs for self-supervised training. This strategy has effectively trained the model to align visual 
and textual representations within a latent space, facilitating robust feature extraction and zero-shot learning 


capabilities. 
(c) Diffusion models 


Diffusion denoising probabilistic models, commonly known as diffusion models [78], have gained popularity as a 
notable class of generative models, recognized for their exceptional synthesis quality and controllability. The 
fundamental principle of these models involves training a denoising autoencoder to approximate the reverse of a 
Markovian diffusion process [79]. By leveraging generative training on large-scale datasets with image-text pairs, 
such as LAIONSB [77], diffusion models develop the ability to produce high-quality images featuring diverse 
content and coherent structures. Recently, a controllable architecture called ControlNet [80] has been introduced, 
enabling the addition of spatial controls, such as depth maps and human poses, to pre-trained diffusion models, 


thereby expanding their applicability to controlled image generation. 
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Figure 1. Architecture of CLIP model 
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3.1. CLIP Model 


CLIP is a multi-modal vision and language model that has demonstrated impressive results in image-text similarity 
and zero-shot image classification, leveraging extensive training data and large-scale models. CLIP consists of an 
image encoder, such as CNN-based or Transformer-like models, and a causal language model to obtain text 
features. During the pre-training phase, CLIP uses large-scale image-text pairs for self-supervised contrastive 


learning, aligning images and texts into the same latent space. 


As shown in Figure 1, In zero-shot image classification tasks, given M class labels for classification (e.g., "cat", 
"dog"), CLIP incorporates these class labels into pre-designed hard/unlearnable text prompts, such as "a photo of a 
[class]", forming a prompt set like "a photo of a cat", "a photo of a dog", etc. These prompts are then fed into the text 
encoder to obtain M text features T;, where i € {1,2,... M}. The testing image is input into the image encoder to 


obtain an image feature Ir. The cosine similarity is calculated between the normalized image feature and all text 
features, formally, S im(T;, Ir) = Ty - I, and the text feature T; with the highest similarity to Ir is considered the 


image’s category. 
3.2. Diffusion U-Net 


Diffusion Models are generative models used to generate data similar to the training data. Fundamentally, 
Diffusion Models work by progressively adding Gaussian noise to training data and then learning to recover the 


data by reversing this noising process. 


Diffusion models achieve high controllability through effective cross-attention layers in the denoising U-Net, 
facilitating interactions between image features and various conditions. ControlNet, a neural network that enhances 
image generation in Stable Diffusion by adding extra conditions, allows users to control the images generated more 
precisely. ControlNet enhances the fine-grained spatial control on latent diffusion models (LDM) by leveraging a 
trainable copy of the encoding layers in the denoising U-Net as a strong backbone for learning diverse conditional 


controls. 


During the training of the ControlNet framework, images are first projected to latent representations Z, by a trained 
VQGAN consisting of the encoder (EEE) and the decoder (DDD). Denoting z, as the noisy image at the s-th 


timestep, it is produced by: 


Zz; = f@&z + /1- 4, (1) 


where @ = []j_, a; and e ~ N(0,/). By utilizing fine-grained conditions, ControlNet achieves controllable 


human image generation with various conditions based on the semantic information of the input. 
3.3. Proposed Out-of-Distribution Detection Method 


The features extracted from the CLIP model can be highly beneficial for classifying input images and 


distinguishing between in-distribution (ID) and out-of-distribution (OOD) samples. 
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We fine-tune a pre-trained denoising U-Net, guided effectively by the condition injection from features extracted 


by CLIP. The denoising U-Net is designed to reconstruct input images, and we use the reconstruction error to 
generate precision-recall curves for OOD detection. To prepare the input images, we convert the cropped image I 
from pixel space to obtain the latent representation from the image encoder as part of the CLIP model. We then feed 
the image into the U-Net with guidance extracted from CLIP. The encoder takes grayscale input images of size 128 
x 128 x 1 and progressively reduces the spatial dimensions while increasing the number of channels, culminating in 
a bottleneck layer. The decoder then upscales and reconstructs the original input image through transposed 


convolutions and activations. 


During training, the model is optimized to minimize the Mean Squared Error (MSE) loss between the reconstructed 
heatmaps and the original input. During inference, the threshold for distinguishing between in-distribution and 
out-of- distribution samples is set as the maximum reconstruction error of the in-distribution samples. With this 
approach, any sample with a reconstruction error above the threshold is classified as OOD, while samples below the 


threshold are considered in-distribution. 
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Figure 2. Architecture of the proposed method 
“© 4, Experiments 
4.1. Experimental Details 
4.1.1. Datasets 


To evaluate the efficacy of our proposed OOD detection method, we conducted extensive experiments using 
established benchmarks. We utilized the ImageNet-1K [81] dataset with 1,000 classes as the in-distribution (ID) 
dataset. For out-of-distribution (OOD) datasets, we selected subsets from Texture [82], iNaturalist [83], Places 
[84], and SUN [85], ensuring that the concepts in these datasets do not overlap with ImageNet-1K. Specifically, the 
entire Texture dataset was used for evaluation. Additionally, 110 plant classes not present in ImageNet-1K were 
selected from iNaturalist, 50 categories not present in ImageNet-1K were selected from Places, and 50 unique 


nature-related concepts were selected from SUN. 
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4.1.2. Implementation Details 


We employed the CLIP model based on CLIP-B/16, pre-trained from OpenCLIP [86]. For the denoising U-Net, we 
used Stable Diffusion V1-5 with ControlNet, pre-trained for image generation. The model was fine-tuned using 
ImageNet-1K samples for 10 epochs. During fine-tuning, the Mean Squared Error (MSE) loss was minimized 


between the reconstructed images and the original inputs to enhance the model’s reconstruction capabilities. 
4.2. Comparison with Existing Models 


The results of OOD detection on the benchmark datasets are summarized in Table 1. Our proposed method 
consistently achieves superior or comparable performance across individual OOD datasets and in the averaged 
results. Compared with zero-shot methods, our approach surpasses the best competing method, CLIPN [87], by 
approximately 1.5% in FPR95, despite CLIPN requiring an additional large external dataset to train an additional 
negative text encoder. Although our method is significantly more lightweight than CLIPN in model size, it 
consistently outperforms CLIPN in both metrics across all OOD datasets. Adapted post-hoc methods generally do 


not leverage CLIP’s capabilities well and thus perform less effectively. 


Table 1. Performance metrics across various datasets 


Dataset iNaturalist SUN Places Texture Average 


Metrics FPR95 | AUROC | FPR95 | AUROC | FPR95- | AUROC | FPR95 | AUROC | FPR95 | AUROC 


Zero-shot methods 


MCM [88] 30.94 94.61 37.67 92.56 44.76 89.76 57.91 86.1 42.82 90.76 
GL-MCM [89] 15.18 96.71 30.42 93.09 38.85 89.9 57.93 83.63 35.47 90.83 
CLIPN [87] 23.94 95.27 26.17 93.92 33.45 92.28 40.83 90.93 31.1 93.1 


CLIP-based posthoc methods 


MSP [43] 74.57 77.74 76.95 73.97 79.12 72.18 73.66 74.84 76.22 74.98 
MaxLogit [90] 60.88 88.03 44.83 91.16 55.54 87.45 48.72 88.63 52.49 88.82 
ODIN [91] 30.22 94.65 54.04 87.17 55.06 85.54 51.67 87.85 47.75 88.8 

ViM [92] 32.19 93.16 54.01 87.19 60.67 83.75 53.94 87.18 50.2 87.82 
KNN [93] 29.17 94.52 35.62 92.67 39.61 91.02 64.35 85.67 42.19 90.97 


Prompt Learning Methods 


CoOp [94] 29.81 93.77 40.83 93.29 40.11 90.58 45 89.47 51.68 91.78 
Proposed Method 
Ours 15.03 96.45 24.95 94.59 33.17 90.83 41.85 91.02 28.75 93.25 


Our method also substantially surpasses prompt learning-based methods, reducing the FPR95 by about 23%. This 
indicates that the learned diffusion model provides informed knowledge about OOD data, which is lacking in the 


competing methods, significantly reducing detection errors. 
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“5, Conclusion 

In this paper, we introduced a novel approach to out-of-distribution (OOD) detection by combining the feature 
extraction capabilities of CLIP with the generative power of diffusion models. Our method involves encoding 


images with CLIP and using these features as conditional inputs for a diffusion model to reconstruct the images. 


The discrepancy between the original and reconstructed images serves as a robust indicator for OOD detection. 


Our approach offers several advantages over existing methods. Firstly, it does not require labeled OOD data, 
making it more practical and scalable for real-world applications. By leveraging only in-distribution samples for 
training, our method effectively discerns between in-distribution and OOD samples during testing. Secondly, the 
integration of CLIP’s zero-shot classification capability enhances the versatility of our method, allowing for 


effective image classification without the need for model fine-tuning. 


We conducted extensive experiments on multiple benchmark datasets, including ImageNet-1K, Texture, 
iNaturalist, Places, and SUN. The results demonstrate that our method achieves significant improvements in 
detection accuracy, with substantial reductions in false positive rates and enhanced detection metrics across diverse 
datasets. These findings underscore the potential of integrating pre-trained models to enhance the reliability of 


OOD detection, paving the way for the deployment of more dependable machine learning systems. 
Future work could explore further enhancements to our method, such as: 

(i) Incorporating additional types of pre-trained models to further enhance detection accuracy. 

(ii) Refining the reconstruction process to improve the robustness and accuracy of OOD detection. 


(iii) Applying the approach to other domains beyond image data, such as natural language processing or audio data, 


to broaden its applicability. 


(iv) Investigating the impact of different types of noise in the diffusion process to achieve more robust OOD 


detection. 


(v) Integrating the method with real-time systems to evaluate performance in dynamic and unpredictable 


environments. 


(vi) Extending the framework to handle multi-modal data inputs simultaneously, enhancing its capability to detect 


OOD instances across various data types. 
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